The UN speech that US president Donald Trump gave to the United Nations General Assembly on Sept. 19, 2017 was fascinating for several reasons. For me personally, it was interesting because the speech included surprisingly complex sentences and statements, not characteristic of Trump’s previous talks. It was also intriguing because this speech was quite undiplomatic and fierce to a point that steered the world to the brink of a thermonuclear war. At least North Korea is genuinly pissed.

As it can be expected when a president addresses the UN, several countries were mentioned in the speech; some in a favorable and others in a negative context. I wanted to make an analysis that connects the sentiments of the speech with the countries, to see how these countries are regarded by the US.

library(magrittr)
library(tidyverse)
library(rvest)
library(tidytext)
library(forcats)
library(wordcloud)
library(wordcloud2)
library(rworldmap)
library(stringr)
library(ggrepel)
# Get speech excerpt
url <- "http://www.politico.com/story/2017/09/19/trump-un-speech-2017-full-text-transcript-242879"
speech_excerpt <- 
        read_html(url) %>% # Download whole homepage
        html_nodes("style~ p+ p , .lazy-load-slot+ p , .fixed-story-third-paragraph+ p , .story-related+ p , p~ p+ p") %>% # Select the required elements (by css selector)
        html_text() %>% # Make it text
        .[-2] %>% # Remove some random homepage text
        gsub("Mr.", "Mr", ., fixed = T) %>% # Make sure that dots in the text will not signify sentences
        gsub("United States of America", "usa", .) %>% # USA has to be preserved as one expression
        gsub("United States", "usa", .) %>% 
        gsub("America", "usa", .) %>% 
        gsub("States", "usa", .) %>% 
        gsub("North Korea", "north_korea", .) %>% # North Korea should be preserved as one word for now
        data_frame(paragraph = .)

# Tokenize by sentence                        
speech_sentences <- 
        speech_excerpt %>% 
        unnest_tokens(sentence, paragraph, token = "sentences")
        
# Tokenize by word
speech_words <- 
        speech_excerpt %>% 
        unnest_tokens(word, paragraph, token = "words") %>% 
        mutate(word = gsub("_", " ", word)) %>% 
        # Here comes a nasty manual stemming of country names. Sadly, I failed to get satisfactory results  on country names with standard stemmers (I tried snowballC, hunspell, and textstem). I also tried to create a custom dictionary with added country names to no avail. What am I missing? Anyway, this works.
        mutate(word = word %>% 
                       str_replace_all("'s$","") %>% # Cut 's
                       if_else(. == "iranian", "iran", .) %>% 
                       if_else(. %in% c("usans", "north koreans"), str_replace(., "ns$",""),.) %>% 
                       if_else(. %in% c("usan","syrian","african","cuban","venezuelan"), str_replace(., "n$",""),.)
        )

Exploring the text

The following word cloud shows the words mentioned in the speech. Larger words were used more frequently than smaller ones.

speech_words %>% 
        anti_join(stop_words, by = "word") %>% 
        count(word, sort = TRUE) %>% 
        wordcloud2()
        # wordcloud2(figPath = "trump.png") # Wanted to make a wordcloud in the form of Trump's head, but the package has a know bug that prevented me to do so.

Next, I looked at the most frequent emotional words - that was used at least three times - in the speech. It turns out that the majority of frequent emotional words had a positive connotation (e.g. prosperity, support, strong, etc.). From the negative words, the most frequent emotional words were related to conflict (conflict, confront, etc.).

# Check emotional words that were uttered at least 3 times
speech_words %>% 
        count(word) %>% 
        inner_join(get_sentiments("bing"), by = "word") %>% 
        filter(n >= 3) %>% 
        mutate(n = if_else(sentiment == "negative", -n, n)) %>% 
        ggplot() +
                aes(y = n, x = fct_reorder(word, n), fill = sentiment) +
                geom_col() +
                coord_flip() +
                labs(x = "word", 
                     y ="Occurance in speech",
                     title = "Most common words in Trump's 17/09/19 UN speech by sentiment")

Just to show the less frequent emotional words too, the next word cloud shows all emotional word sentiments.

speech_words %>%
        inner_join(get_sentiments("bing"), by = "word") %>% 
        count(word, sentiment, sort = TRUE) %>%
        spread(sentiment, n, fill = 0L) %>%
        as.data.frame() %>% 
        remove_rownames() %>% 
        column_to_rownames("word") %>% 
        comparison.cloud(colors = c("red", "blue"))

Let’s look into specific emotions using the nrc sentiment dictionary! It is also possible to make an association between certain words and distinct emotions. The next plot shows the frequency of each emotion in the talk It seems like the emotion that dominated the speech was trust, followed by fear and anticipation.

speech_words %>%
        inner_join(get_sentiments("nrc"), by = "word") %>% # Use distinct emotion dictionary
        filter(!sentiment %in% c("positive","negative")) %>% # Only look for distinct emotions
        group_by(sentiment) %>% 
        count(sentiment, sort = T) %>% 
        ggplot() +
                aes(x = fct_reorder(sentiment %>% str_to_title, -n), 
                    y = n, 
                    label = n) +
                geom_col() +
                geom_label(vjust = 1) +
                theme_minimal() +
                labs(title = "The occurance of words linked to distinct emotions in the speech", 
                     x = "Word", 
                     y = "Frequency")

Let’s put the sentiments on a map

First, let’s see which countries were mentioned and how many times. Obviously, the (United States of) America is first! Iran, Venezuela, North Korea were also mentioned more than 5 times.

# Load map database
map_world <- 
        map_data(map="world") %>% 
        mutate(region = region %>% str_to_lower()) # Make country name lower case to match word

# Calculate mentions of a country, and join geodata
trump_countries <-
        speech_words %>% 
        count(word) %>% 
        right_join(map_world, by = c("word" = "region")) %>% # Match country coordinates to speech
        select(region = word, everything())

# Get country names with the middle of the country coordinates
country_names <- 
        trump_countries %>% 
        drop_na(n) %>%
        group_by(region) %>% 
        summarise(lat = mean(lat),
                  long = mean(long))

Let’s see which countries are mentioned the most times. Obviously, the USA! Also Iran, Venezuela, and North Korea are mentioned several times. Apart from these, most countries are mentioned only a couple of times during the speech.

trump_countries %>% 
        ggplot() +
        aes(map_id = region, 
            x = long, 
            y = lat, 
            label = paste0(region %>% str_to_title(),": ", n)) +
        geom_map(aes(fill = n %>% log10(.)), 
                 map = trump_countries) +
        geom_label_repel(data = trump_countries %>% 
                                 drop_na(n) %>% 
                                 group_by(region) %>% 
                                 slice(1), 
                         alpha = .75) +
        scale_fill_gradient(low = "lightblue", 
                            high = "darkblue", 
                            na.value = "grey90") +
        labs(title = "Number of mentions by country", 
             x = "Longitude", 
             y = "Latitude") +
        theme_minimal() +
        theme(legend.position = "none")

Next, I wanted to see how the speech developed over time, and what was the sentiment of the sentences. Moreover, I wanted to include which countries were mentioned in particular parts of the talk.

# Sentiment of each sentence
sentence_sentiment <-
speech_sentences %>% 
        mutate(sentence_num = row_number(),
               sentence_length = length(sentence)
        ) %>% 
        unnest_tokens(word, sentence, "words") %>% 
        mutate(word = gsub("_", " ", word)) %>% 
        # Here comes a nasty manual stemming of country names. Sadly, I failed to get satisfactory results  on country names with standard stemmers (I tried snowballC, hunspell, and textstem). I also tried to create a custom dictionary with added country names to no avail. What am I missing? Anyway, this works.        
        mutate(word = word %>% 
                       str_replace_all("'s$","") %>% # Cut 's
                       if_else(. == "iranian", "iran", .) %>% 
                       if_else(. %in% c("usans", "north koreans"), str_replace(., "ns$",""),.) %>% 
                       if_else(. %in% c("usan","syrian","african","cuban","venezuelan"), str_replace(., "n$",""),.)
        ) %>% 
        left_join(get_sentiments("bing"), by = "word") %>%
        mutate(sentiment_score = case_when(sentiment == "positive" ~ 1,
                                           sentiment == "negative" ~ -1,
                                           is.na(sentiment) ~ NA_real_)) %>%
        group_by(sentence_num) %>%
        summarise(sum_sentiment = sum(sentiment_score, na.rm = T),
                  sentence = paste(word, collapse = " "))

# Which sentence has a country name
country_sentence <- 
        speech_sentences %>% 
        mutate(sentence_num = row_number()) %>% 
        unnest_tokens(word, sentence, "words") %>% 
        mutate(word = gsub("_", " ", word)) %>% 
        right_join(country_names %>% select(region), by = c("word" = "region")) %>% 
        arrange(sentence_num)

# Sentiment for each country
country_sentiment <-         
        sentence_sentiment %>% 
        full_join(country_sentence, by = "sentence_num") %>% 
        select(region = word, sum_sentiment) %>% 
        drop_na() %>% 
        group_by(region) %>% 
        summarise(country_sentiment = sum(sum_sentiment, na.rm = T))

looking into the text and and check

sentence_sentiment %>% 
        full_join(country_sentence) %>% 
        mutate(sentiment_type = case_when(sum_sentiment >0 ~ "positive",
                                          sum_sentiment <0 ~ "negative",
                                          sum_sentiment == 0 ~ "neutral") %>% 
                                fct_rev()
                       ) %>% 
        
        ggplot() +
                aes(x = sentence_num, 
                    y = sum_sentiment, 
                    label = word %>% str_to_title()) +
                geom_hline(yintercept = 0, 
                           color = "grey", 
                           linetype = "dashed", 
                           size = 1.2) +
                geom_smooth(span = 0.05, 
                            se = F, 
                            size = 1.2, 
                            color = "black") +
                geom_label_repel(aes(fill = sentiment_type), 
                                 alpha = .8, 
                                 segment.alpha = 0) +
                scale_fill_manual(values = c("green","grey","red")) +
                theme_minimal() +
                labs(x = "Sentence number", 
                     y = "Sentence sentiment", 
                     title = "The summarised sentiment of sentences, and the appearance of country names in the speech \nby sentiment in sentence order",
                     subtitle = "The dashed line signifies neutral sentence sentiment. \nCountry label colors show the direction of the sentiment (positive/negative)") 

First, it is important to note that the sentiment analysis is based on the summarised sentiments for each sentence, which can be misleading. For example, in the middle of the speech, Israel and the US are mentioned in a very negative sentence, but it is because it is mentioned how Iran speaking openly about mass murder

As we can see, Trump strarted the speech with positive statements, and praised the USA. Then North Korea, China, Ukraine, Russia, and Israel were mentioned in a negative context. Then again, several Middle Eastern and African countries were mentioned in a generally positive context. South American countries - such as Cuba, and especially Venezuela - were also scolded here, and as the US

So, how about summarizing the country sentiments throughout the whole text, and plot them on a map, to see which countries

sentiment_map_data <- 
        trump_countries %>% 
        left_join(country_sentiment, by = "region")

sentiment_map_data %>% 
        mutate(country_sentiment = if_else(region == "usa", NA_real_, country_sentiment)) %>% # Exclude US
        ggplot() +
                aes(    map_id = region, 
                        x = long, 
                        y = lat, 
                        label = paste0(region %>% str_to_title(), ": ", country_sentiment)
                        ) +
                geom_map(aes(fill = country_sentiment), 
                         map = trump_countries) +
                scale_fill_gradient(high = "green", 
                                    low = "red", 
                                    na.value = "grey90") +
                geom_label_repel(data = sentiment_map_data %>%
                                         drop_na(n) %>%
                                         group_by(region) %>%
                                         slice(1),
                                         alpha = .5
                                 ) +
                theme_minimal() +
                labs(title = "Sentiment of the sentences where countries were mentioned (USA excluded)", 
                     x = "Longitude", 
                     y = "Latitude")